Sparse Causal Discovery 
in Multivariate Time Series 



Stefan Haufe* Guido Nolte 

q Berlin Institute of Technology Fraunhofer First, Berlin 

Klaus-Robert Muller Nicole Kramer 

Berlin Institute of Technology Berlin Institute of Technology 

i/-} January 15, 2009 

Abstract 

Our goal is to estimate causal interactions in multivariate time series. 
+J Using vector autoregressive (VAR) models, these can be denned based on 

non-vanishing coemcients belonging to respective time-lagged instances, 
r/j As in most cases a parsimonious causality structure is assumed, a promis- 

ing approach to causal discovery consists in fitting VAR models with an 
I additional sparsity-promoting regularization. Along this line we here pro- 

pose that sparsity should be enforced for the subgroups of coefficients 
that belong to each pair of time series, as the absence of a causal rela- 
te) tion requires the coefficients for all time-lags to become jointly zero. Such 
£N| behavior can be achieved by means of fi^-norm regularized regression, 
C^l for which an efficient active set solver has been proposed recently. Our 
* method is shown to outperform standard methods in recovering simulated 

causality graphs. The results are on par with a second novel approach 
which uses multiple statistical testing. 

Keywords Vector Autoregressive Model, Granger Causality, Group Lasso, 
iltiplc Testing 



Multiple Testing 
>< 

c5 1 Introduction 



Causality is commonly denned based on the widely accepted assumption that 
an effect is always preceded by its cause. Granger| ( |1969 1 postulates a measure 



of causal influence between two time series (Granger Causality) . In a nutshell, 
time series Zj Granger-causes time series Zj if knowledge of past values of zi 
improves the prediction of Zj (compared to only using past values of zj). In the 
case of a set F — {zi, . . . , zm} of time series, the pairwise analysis may lead to 
spurious detection of a causal relation. For this reason it is advisable to include 
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the set F \ {zi,Zj} of all additional observable time series in both prediction 
tasks. Note that this approach resolves the problem of common hidden factors 
if £ F . (If common factors are not observable, Granger causality fails and 
we refer to Nolte et al. (20081 for a detailed discussion and a remedy in form 



of the Phase Slope Index.) While this approach, to which we refer as complete 
Granger Causality, is practical, a more elegant way to deal with multivariate 
data is to handle all potential causal relations between all time series at once. In 
this paper, we assume a linear dynamics of the underlying system, which leads 
to the vector autoregressive (VAR) model. 

In many applications the true causality graph is assumed to be sparse, i.e. 
only a few causal interactions between time series are expected. However, both 
Ordinary Least Squares (OLS) and Ridge Regression, which are usually used for 
fitting VAR models, are known for producing dense coefficients. Only recently 
Valdes-Sosa et al. (2005) have proposed to enforce estimation of sparse AR 



coefficients using ^j-norm regularized models such as the Lasso (Tibshirani 



1996). 



In this paper we propose a novel sparse approach which - unlike Lasso - 
accounts for the fact that the absence of a causal relation between Zi and Zj 
requires all AR coefficients belonging to that certain pair of time series to be 
jointly zero. Furthermore, we consider Ridge Regression in combination with 
the multiple statistical testing procedure provided by 
More details on the methodology are given in Section 



These methods are 
evaluated and compared to standard approaches in extensive simulations. 



Hothorn et al. 



(2008). 



2 Background 

In this section, we briefly summarize related approaches to estimate sparse vec- 
tor autoregressive models in the context of causal discovery. We roughly distin- 
guish between sparse estimation methods and testing strategies. 

Given a multivariate time series z(t) £ K M , a linear vector autoregressive 
process of order P is defined as 



p 

z(t) = Y J Ai Mt-p)+^), a) 
P =i 

where £ U MxM , e ~ Af(0, a 2 1) and teZ indicates time. Hence, the signal 
at time t is modeled as a linear combination of its P past values and Gaussian 
measurement noise. Inspired by the initial assumption that the cause should 
always precede the effect, we suggest the following definition of causality in the 
case of vector autoregressive models. We say that time series z% has a causal 
influence on time series Zj if for at least one p £ {1,...,P}, the coefficient 
corresponding to the interaction between Zj and Zi at the pth time-lag is 
nonzero. 
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Thus, causal inference may be conducted by estimating the matrices A^ pS> 
from a sample Z = (z(l), . . . ,z(T)). Let us introduce the following shortcuts. 
We denote by A = (Ay> , . . . , A^ p ^ the matrix of all VAR coefficients and set 

X = {Z 1 ,...,Z P ), Y = Z a , Z p = (z(P+l-p),...,z(T-p)) T . Here vec(-) 
denotes the vectorization operation. 



2.1 Sparsity 



Probably the most straightforward way to estimate a sparse VAR is to use 
£i-regularization on the set of coefficients, 

ilasso 



.4 1 



argmin ||vec(A^4 
A 



rjll^ + AHvecCA)^ , A>0 



Recently, Valdes-Sosa et al. ( 2005 ) proposed a combination of VAR-estimation 



and the Lasso (Tibshirani 19961. While Valdes-Sosa et al. (2005) only consider 



a VAR model of order 1, there have been extensions to higher orders (e.g. Arnold 
et al. 2007). However, we note in the latter case, Lasso is not used on the VAR 



coefficients directly, but that the problem is transformed into the task of esti- 
mating partial correlation coefficients between time-lagged copies of the time 



series (see also lOpgen-Rhein and Strimmer 2007) 



2.2 Testing 



Just as in the case of sparse methods, it is often suggested to transform the 
regression task into the estimation of the matrix of partial correlation coeffi- 
cients between time-lagged copies of the time series. While Drton and Pcrlman 
( 2008 1 estimate the correlation matrix in an unregularized way, Opgen-Rhein 



and Strimmer ( 2007 ) propose a shrinkage estimator, which is superior in the case 
of high-dimensional data (Schafer and Strimmer 20051. Afterwards, significant 



partial correlations are detected by controlling false discovery rates. While the 
latter approach is only tested for P = 1, it is straightforward to extend it to 
higher order VAR's. 



3 Our Approach 

In the following, we provide the details regarding the groupwise sparsity and 
the alternative testing strategy respectively. 



3.1 Ridge Regression and Multiple Testing 

Under the assumption of Gaussian white noise it is natural to estimate the AR 
coefficients using regularized least squares, and probably the most straightfor- 
ward way to do so is to use Ridge Regression, 

^idge = argmin || vec (A04 - Y)\\*+\ ||vec(A)||* = (X T X+\I)~ 1 X T Y , A > . 

(2) 
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Thanks to the Ridge penalty, Eq. ^ delivers solutions with small coefficients, 
which, however, are in general never exactly zero. In the strict sense of Granger, 
this corresponds to a fully-connected dependency graph, rendering Ridge Re- 
gression an improper candidate for sparse causal recovery. On the other side, 
many of the estimated coefficients are expected to be non-significant. Hence, 
we propose a sparsification through statistical testing. In contrast to to e.g. 
bootstrapping, we derive p-values explicitly using the approximate distribution 
of the coefficients. 

It is apparant from Eq. ^ that the estimation can be done independently 
for each column of A, and so does the testing. Let therefore at^ denote the fcth 
column of A and let = (zk(P + 1), . . . , Zk(T)) T . Neglecting the dependency 
of X and Y, the Ridge coefficients depend linearly on Y, and we can conclude 
that under the null-hypothesis Hq : otk — 0, we have Sfc ~ jV(0,(T^E) with 

e = (x T x + xiy 1 x T x (x T x + xiy 1 . 

Furthermore, setting H = X (X T X + Xl) 1 X T an estimate of the model vari- 
ance a\ is given by 



I Yfe - Hy k 



trace ((/ - H){I - H T )) 



(3) 



Using Eq. (|3| we can now construct normalized test statistics an- = a^/ y cr^E;, 
which are jointly normally distributed with a ~ ftf(0, R) and Rij := E.y/ y/T,uT,jj. 
Suppose we want to test all individual hypotheses H 4 : = simulta- 



neously, then, according to Hothorn et al. ( |2008 1, the adjusted p-values are 



Pi = 1 — g (R, \aik\)- We reject a hypothesis, if the p-valuc is below the prede- 
fined significance level 7. Here, 

g{R,t) = P (max|a ife | < tj = J ...J 4>(ai, . . . , a M p)dai ■ ■ ■ da M p (4) 

and 4>(ct) is the density function of the multivariate normal distribution 7V(0, R). 



3.2 Group Lasso 

Sparse causal discovery using Ridge Regression is a two-step procedure and 
may possibly suffer from the aggregation of assumptions that enter in each 
step. Direct estimation of sparse VAR coefficients (e.g. via Lasso) is therefore 
desirable, as this would allow omission of the multiple significance testing step. 
However, for higher order models, this approach is prone to selecting a different 
set of causal interactions for each of the P time lags. We here suggest that this 
behavior can be overcome by enforcing joint sparsity of the coefficient vectors 
that belong to a certain pair of time scries. This corresponds to incorporating 
the prior belief that causal influences between time series are not restricted to 
only one particular time lag into the estimation. The positive effect of such 
modeling can be verified in Figure fl| (see Section PA for more details) . 
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The idea of imposing groupwise sparse coefficients leads to 2 -norm regu- 



which has also applications in Multiple Kernel Learning (Bach 



et al.| 


2008 


1, which has also 


et al.| 


2004 |Sonnenburg et al. 


Haufe et al. 2008 


I. The term 



2006 Meier 



2006| ) and the EEG/MEG inverse problem (e.g. 
\ 2-norm corresponds to an ^j-norm of a vector 



of ^2-norms. Our proposed objective is given by 



j^glasso _ 



argrnin ||vec(XA — F)|| 2 



s.t. 



L (1) 

l ll > • • 



t (P) 

M M , 



A 



(i) 



,-4 



(P) 



< K 



(5) 
(6) 



This penalty leads to a groupwise variable selection, i.e. a whole block of co- 
efficients is jointly zero. Note that the first term in Eq. ^ penalizes all MP 
coefficients describing univariate relations. In this way, those coefficients are 
shrunk and hence, overfitting is avoided. Furthermore, we remark that it is also 
conceivable to to split the estimation of A into M subproblems (as suggested in 
Subsection |3.1[ ), which is desirable in large-scale scenarios. 

Eqs. ([5]) and ^ define a non-differentiable but convex optimization problem 
which can be solved by means of Second-order Cone Programming (SOCP). 
For problems with sparse expected structure, however, the optimization can be 



carried out much more efficiently using the results of Roth and Fischer ( 2008 1 . 
By keeping a set of active coefficient groups, their algorithm needs to call the 
SOCP solver only for problem sizes far smaller than the original problem - 
leading to a considerable reduction of memory usage and computation time. In 



the experiments, we employ the active-set algorithm of Roth and Fischer (20081 



in combination with a freely available SOCP solver ( Sturm 1999 ) 



4 Simulations 

We conduct a series of experiments in which the causal structure of simulated 
data has to be recovered. We compare the Group Lasso, standard Lasso, Ridge 
Regression with multiple testing and complete Granger Causality based on AR 
models. All four approaches are applied both with and without knowledge of 
the true model order. In the latter case P = 10 is chosen for the reconstruction. 
For all methods considered, it is also possible to estimate the model order P, 
e.g. via cross-validation. 



4.1 Setup 

Each simulated data set consists of a multivariate time series with parameters 
M = 7 and T = 1000 that is generated by a random VAR process of order P = 5 
according to ([lj. The distribution of the noise component e(i) is chosen to be 
the standard normal distribution. The VAR coefficients for all but 10 randomly 
chosen pairs of time series are set to zero, yielding exactly 10 causal interactions. 
The non-zero coefficients are drawn randomly from Af(0, 0.04/). Each set of 
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VAR coefficients is tested for the stability of its induced dynamical system 
by looking at the eigenvalues of the corresponding transition matrix. Only 
coefficients leading to stable systems (i.e those with transition matrices with 
eigenvalues of at most 1) are accepted. We consider the following three types of 
problems, for each of which we create 10 instances: 1) no noise is added to the 
data generated by the VAR model 2) the data is superimposed by Gaussian noise 
of approximately the same strength, which is uncorrelated (white) both across 
time and sensors 3) the data is superimposed by mixed noise of approximately 
the same strength, which is generated as a random instantaneous mixture of M 
univariate AR processes of order 20. Note that in none of these cases the noise 
itself possesses a causal structure which would superimpose the true structure. 

For measuring performance we consider Receiver Operating Characteristics 
(ROC) curves, which allow objective assessment of the performance in different 
regimes (e.g. very few false positives). As an additional measure of absolute 
performance we calculate the Area Under Curve (AUC). ROC curves and AUC 
values are averaged across the 10 problem instances and standard errors are 
computed for AUC. 

Granger Causality is calculated using the Levinson-Wiggens-Robinson al- 
gorithm for fitting AR models (Marple 19871, which is available in the open 
source Biosig toolbox (Schlogl 20031. Note that for this particular method, we 
use Granger's original definition of causal influence instead of our coefficient- 
based approach. That is, for a pair of time series Zj and Zj we calculate the 
logarithm of the ratio of the residuals of the two AR models 1) including inter- 
actions and 2) excluding interactions between Zi and Zj (Granger score). This 
score is divided by its standard deviations as estimated by the jackknife. To 
obtain a ROC-curve, the Granger score is threshold at different values, ranging 
from completely sparse to completely dense solutions. 

For Ridge Regression, the regularization parameter A is chosen via 10-fold 
cross-validation (with respect to prediction accuracy). For this value of A, we 
derive the test statistics defined in Subsectior l3.ll The multidimensional inte- 
grals in Eq. Q are computed using Monte Carlo sampling according to Genz 
(19921. ROC-curves are constructed by varying the significance level 7. 

For Lasso and Group Lasso, solutions ranging from completely sparse to 
completely dense are obtained through variation of the regularizing constant A 
and k respectively. 



4.2 Results and Discussion 

First, we illustrate the different behavior of the investigated methods in Figure 
[T] This example corresponds to the situation without noise and with known 
model order P = 5. The left figure shows the true underlying causal structure, 
with a black box indicating a causal interaction. The reconstructions for the 
different methods are based on a single estimate of the VAR coefficients. For 
Granger causality, we use a threshold of 2. For Ridge Regression, we use a 
significance level of 7 = 0.05. For Lasso, Ridge Regression and Group Lasso, 
the regularizing constant is fixed by using 10-fold cross-validation (with respect 
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to prediction accuracy) . We display the binary influence matrix in Figure [T] In 
this example Ridge Regression exhibits perfect reconstruction and outperforms 
all other methods. Group Lasso comes second. Note that, due to the strong 
tendency of Group Lasso to select the same influences for each time lag, its 
estimated causal dependency matrix is sparser than that of Lasso. 



Ntf 



1 2 3 4 5 6 7 

TRUE 



1 2 3 4 5 6 7 

GRANGER 



1 2 3 4 5 6 7 

RIDGE 



1 2 3 4 5 6 7 

LASSO 




1 2 3 4 5 6 7 

GLASSO 



Figure 1 : Simulated causal influence matrix and estimates according to Granger 
Causality, Ridge Regression, Lasso and Group Lasso. 



Table [T] summarizes the AUC scores obtained in the experiments described 
above. The complementing ROC curves are shown in Figure [2] In short it can 
be stated that Group Lasso and Ridge Regression outperform their competitors 
in all scenarios, although not always significantly. While Ridge Regression per- 
forms slightly better than Group Lasso in the noiseless condition, Group Lasso 
has a clearly visible yet insignificant advantage over all methods in the white 
noise setting. Under the influence of mixed noise Ridge Regression and Group 
Lasso are on par. Note furthermore that the ROC curve for Lasso is below 
the ROC curve of Group Lasso, which shows that Lasso tends to be too dense. 
Interestingly, knowledge of the true model order hardly provided any significant 
advantage in our simulations. 









GRANGER 


RIDGE 


LASSO 


GLASSO 


p = 


5 


NO NOISE 
WHITE NOISE 
MIXED NOISE 


0.991 ± 0.004 
0.910 ± 0.023 
0.896 ± 0.012 


1.000 ± 0.000 

0.948 ± 0.020 
0.928 ± 0.010 


0.996 ± 0.002 
0.941 ± 0.021 
0.889 ± 0.011 


0.997 ± 0.002 
0.971 ± 0.016 
0.926 ± 0.012 


p = 


10 


NO NOISE 
WHITE NOISE 
MIXED NOISE 


0.980 ± 0.005 
0.885 ± 0.019 
0.893 ± 0.013 


0.998 ± 0.002 

0.958 ± 0.012 
0.931 ± 0.015 


0.996 ± 0.002 

0.948 ± 0.013 
0.861 ± 0.014 


0.999 ± 0.001 
0.979 ± 0.005 
0.931 ± 0.007 



Table 1: Average AUC scores and standard errors of Granger Causality, Ridge 
Regression, Lasso and Group Lasso in three different noise conditions and for two 
different model orders. Entries with significant superior score are highlighted. 



5 Conclusion 

We presented a novel approach for causal discovery in multivariate time series 
which is based on the Group Lasso. As an alternative we also discussed Ridge 
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NO NOISE WHITE NOISE MIXED NOISE 



Figure 2: Average ROC curves of Granger Causality (red), Ridge Regression 
(green), Lasso (blue) and Group Lasso (black) in three different noise conditions 
and for two different model orders. 



Regression with subsequent multiple testing according to Hothorn et al. 
which is also novel in the context of VAR modeling. 



(2008) 



Both approaches were 
shown to outperform standard methods in simulated scenarios. Future research 
will aim at applying our techniques to real-world problems. Given that the 
sparsity assumption is correct, our Group Lasso approach should be able to 
handle much larger problems than the ones that were considered here by 1) 
splitting the problem into M independent subproblems and 2) using the active 
set solver of Roth and Fischer ( 2008 1 in combination with strong regulariza- 
tion that ensures staying in the sparse regime. We expect that this will allow 
large-scale applications such as the estimation of cerebral information flow from 
functional Magnetic Resonance Tomography (fMRI) recordings to benefit from 
the improved accuracy of our approach. 
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